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E Abstract Pharmacogenomics requires the integration and analysis of genomic, 
molecular, cellular, and clinical data, and it thus offers a remarkable set of challenges to 
biomedical informatics. These include infrastructural challenges such as the creation 
of data models and databases for storing these data, the integration of these data with 
external databases, the extraction of information from natural language text, and the 
protection of databases with sensitive information. There are also scientific challenges 
in creating tools to support gene expression analysis, three-dimensional structural anal- 
ysis, and comparative genomic analysis. In this review, we summarize the current uses 
of informatics within pharmacogenomics and show how the technical challenges that 
remain for biomedical informatics are typical of those that will be confronted in the 
postgenomic era. 


WHAT IS BIOMEDICAL INFORMATICS? 


Biomedical informatics is the study of information flow within biology and medi- 
cine. The use of computational techniques in biomedical research dates back to the 
first general purpose computers but interest in the techniques has exploded in the 
last decade (1). The increased interest stems from the availability of experimental 
techniques that create data that simply cannot be manually analyzed and require 
computational intervention. Many areas of biology and medicine are being revo- 
lutionized by the introduction of new experimental techniques, accompanied by 
informatics methodologies that fundamentally change the way that investigators 
do their work. 

The two flows of information that are studied by informatics are the flow of 
information from the DNA code to biological function and the flow of information 
in the design and analysis of experiments. In the first flow, we are interested in the 
transfer of information within biology, while in the other, we are interested in the 
transfer of information about biology. Thus, the first information flow deals with 
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the central dogma of biology: DNA is transcribed into RNA, RNA is translated into 
protein, and protein molecules have functions that carry out biological processes. 
Interacting proteins produce signaling and metabolic pathways that coalesce to 
form networks at the cellular level, and cells interact at an organismal level to 
produce physiology. Informatics approaches to studying different aspects of this 
flow, therefore, include methods for gene finding (2-5), 3D structure prediction 
(6,7), modeling of genetic networks (8-14), and statistical population biology 
(15, 16). 

In the second flow, we are interested in the ways in which biological and medi- 
cal information is gathered. This flow begins with a scientific hypothesis, followed 
by a plan to collect data, execution of an experiment, analysis of the results, and 
subsequent refinement of the hypothesis. Informatics applications within this flow 
are usually created to support investigators in the practice of science. Informat- 
ics approaches to studying this flow, therefore, include methods for organizing 
and searching databases of literature, sequence, and function, as well as meth- 
ods for helping to create and evaluate scientific models (17). If both of these 
information flows are included in a definition of biomedical informatics, then vir- 
tually all biomedical informatics research can be placed in one or both of these 
areas. 

Biomedical informatics has gained prominence recently because biologists can 
now collect more data. The success of the genome sequencing projects has cat- 
alyzed a new way of thinking in biology, whereby data are collected on a large scale 
and without a particular hypothesis in mind. The data are then placed in a database, 
and scientists with hypotheses can extract information from the database in order 
to evaluate the merits of the hypotheses. This leads to a fundamental change in 
how some investigators do their work: Instead of first moving to the laboratory, 
they first move to the database, and only after assessment of the available data 
are experiments planned. There has been much debate about the merits of such an 
approach, but there is no doubt that the emergence of these large-scale and high- 
throughput methods for data collection makes such an approach feasible (18). The 
data explosion is not limited to DNA sequencing, and we are seeing increased ca- 
pacity to assess the levels of mRNA expression (19, 20), to detect protein-protein 
interactions (21), to locate gene products within the cell (22), to detect and iden- 
tify compounds using mass spectroscopy (23), and even to understand the detailed 
atomic three-dimensional structure of macromolecules and their small molecule 
ligands (24). 

As long as clever experimentalists continue to create these high-throughput 
experimental methods, informatics professionals will have a surfeit of data and 
data analytic challenges. Success in informatics usually means the acceleration in 
understanding the processes of interest, and increased access to the information 
required to generate and test scientific hypotheses. One of the areas that has re- 
cently attracted the attention of biomedical informaticians is pharmacology, and 
particularly pharmacogenetics and pharmacogenomics. 
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WHAT ARE PHARMACOGENETICS AND 
PHARMACOGENOMICS? 


Pharmacogenetics is the study of how variation in genes affects the response to 
drugs. It has existed as a field for more than four decades, and forms the basic 
intellectual framework for understanding phenomena such as the idiosyncratic re- 
sponses to anesthesia, to opiates, and to anticancer agents (25). Pharmacogenetics 
has tended to study single genes with a focused, hypothesis-directed set of experi- 
ments. Most pharmacogenetic studies begin with the recognition of high variability 
in response to a medication and then a search for the genetic basis of the variation. 
Thus, for example, the blood levels of a metabolite may be measured and noted 
to vary widely, investigations of the pathway of metabolism may suggest that en- 
zymes in this pathway are behaving differently, and the genetic analysis of these 
enzymes might detect variations in the protein sequence (or regulatory sequence) 
that explain different catalytic rates or binding constants. 

Pharmacogenomics emerged recently, and scientists do not entirely agree on its 
relationship to pharmacogenetics (26-29). “Genomics” is generally used to indi- 
cate the study of the entire complement of genes within an organism, and the -omics 
suffix has been used generally to indicate the comprehensive analysis of the capa- 
bilities of an organism. Thus, proteomics studies the full set of proteins and how 
they interact within a cell. Based on this understanding, pharmacogenomics can 
be construed as the study of the entire complement of pharmacologically relevant 
genes, how they manifest their variations, how these variations interact to produce 
phenotypes, and how these phenotypes affect drug response. A key element of 
pharmacogenomics is, not surprisingly, the large-scale and high-throughput col- 
lection of data, including DNA sequence variations, mRNA expression analysis, 
enzyme kinetic assays, and cellular localization experiments. The move toward 
these types of experiments of course creates a magnet for biomedical informatics 
investigators, who see an opportunity to apply their methodologies to an exciting 
area with promise to revolutionize medical care. 

The development of pharmacogenomics is a natural sequela to the success of the 
initial human genome sequencing project. The promise of that project was that an 
understanding of all human genes would create the opportunity for new diagnostic, 
prognostic, and therapeutic technologies. The variation in response to medications 
across patients can be large, and the occurrence of side effects and adverse events 
limits the success of many therapeutic strategies. A systematic understanding of 
the gene systems that modulate response to medications may therefore change 
the way medications are prescribed. With the success of pharmacogenomics, it 
may become possible routinely to check the genetic background of a patient in 
order to ensure that the prescribed medications are effective and free from adverse 
side effects (30-33). Although it is not entirely clear how many of the 35,000 
genes assigned in the rough draft of the human genome are relevant to drug re- 
sponse (or even how to define “relevance’’), a systematic analysis of pharmacology 
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textbooks indicates a core set of 500 to 1000 genes, as shown in the PharmGKB 
Web site.! 


WHAT ARE THE APPROACHES TO 
PHARMACOGENOMICS? 


There are generally two approaches to pharmacogenomic research, which are 
summarized as the “genotype-to-phenotype” and the “phenotype-to-genotype” 
approaches. In the genotype-to-phenotype approach, the investigators start with a 
set of genes that are known (or strongly suspected) to be important in modulating 
the response to drugs, and then they search for variation in their sequences (that 
is, their genotype). Given an understanding of genetic variation, they can search 
for the phenotypic consequences. Examples of approaches amenable to genotype- 
to-phenotype analysis might include gene families known to be important for 
pharmacokinetics (the study of how medications are absorbed, distributed, and 
cleared from the body), such as phase I metabolism enzymes (the mixed function 
oxygenases of the cytochrome p450 system) (34), phase II metabolism enzymes 
(the conjugation system) (35), and membrane transporter molecules (36). Other 
systems amenable to the genotype-to-phenotype approach are those that are in- 
volved in pharmacodynamics (the study of how medications have their therapeutic 
effect) and those whose mechanisms are well understood at the receptor and path- 
way level. Examples might include the well described pathways of inflammation 
in asthma (37, 38), the purine/pyrimidine biosynthetic pathways that are targeted 
by some anticancer agents (39), or the enzyme cascade that controls blood clotting 
(40). The steps of a genotype-to-phenotype approach can be summarized in this 
simplified way: 


1. Identify the genes that belong to the system that is involved in modulating 
drug response. 


2. Catalog the variation in the DNA sequences across the population. 
3. Search for phenotypes associated with the sequence variation. 


4. Confirm clinical relevance of the genotype-phenotype associations. 


Each of these steps is nontrivial and complicated. The first involves searching DNA 
databases and perhaps using comparative genomic techniques to identify target 
genes (41,42). The second includes high-throughput experimental methods for 
detecting DNA variations, including single nucleotide polymorphisms (SNPs), the 
most common type of DNA variation (43-45), and also for associating individual 
SNPs into haplotypes (46). The third step involves the collection of molecular, 
cellular, or clinical data (reviewed below), and the final step requires clinical trials 
to prove the associations of interest and to demonstrate clinical relevance. 


‘http://www.pharmgkb.org/ 
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Phenotype-to-Genotype Approaches 


Phenotype-to-genotype approaches toward pharmacogenomic discovery are dif- 
ferent. Instead of identifying a family of genes in which to characterize genetic 
variation, investigators search for a phenotypic measure that shows significant 
variation. This measure can be a clinical measure (such as the rate of clearance 
of a drug or the peak level of the drug for a given dose), a cellular measure (the 
rate of cellular uptake of a drug or the profile of gene expression), or a molecular 
measure (the enzymatic turnover rate of an enzyme or a substrate binding con- 
stant). In any case, it is the phenotypic variation that first draws attention and then 
follows a search for the genes that are responsible for this variation. The steps of 
a phenotype-to-genotype approach, therefore, can thus be summarized: 


1. Identify a phenotype that shows significant variation. 
2. Search for genes that may explain this variation. 
3. Characterize genetic variations and check for association with the phenotype. 


4. Confirm proposed genetic basis for the variation and its clinical relevance. 


The challenges in the first step are to identify phenotypes that are both clinically 
relevant and also measurable. The second step is the most difficult and requires 
the investigator to use any means available to identify genes that could be involved 
with the phenotypes. It may involve using animal models and comparative ge- 
nomics, DNA microarray analysis to measure changes in expression in response to 
drugs, database (literature and sequence) searches for associations between genes 
and related phenotypes, or analytic chemistry methods to identify gene products 
contributing to variation (47). The third step is similar to the second step of the 
genotype-to-phenotype process. A major challenge in this step is the large amount 
of variability in human genes that is not functionally significant, so investigators 
must focus efforts on variations that can be shown to have functional consequence. 
The final step is focused particularly on this problem of ensuring that the discovered 
genetic component really explains the phenotypic variation of interest. 

Both approaches to pharmacogenomics have strengths and weaknesses. Investi- 
gators must assess the current knowledge base for a given drug class of interest 
in order to determine whether there is enough genetic information to justify a 
genotype-to-phenotype approach, or whether there are more striking phenotypic 
data suggesting a phenotype-to-genotype approach. 


CHALLENGES FOR BIOMEDICAL INFORMATICS 
IN PHARMACOGENOMICS 


The challenges for biomedical informatics within the study of pharmacogenomics 
all follow directly from the preceding discussion. Pharmacogenomics is relatively 
new, so the current excitement derives, in part, from the great range of opportunity 
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for contributions that now exists. One of the key themes in pharmacogenomics is 
that the relevant informatics expertise includes information from molecular biol- 
ogy (sequences, structures, pathways) as well as from clinical medicine (medica- 
tions, diseases, side effects), and of course from pharmacology (pharmacokinetics 
and pharmacodynamics). Thus it represents a new wave of informatics problems 
where both basic biological and clinical information must be combined and ana- 
lyzed. Whereas previously bioinformatics focused solely on issues of relevance to 
molecular biology (sequence and structure analysis), applications are now mov- 
ing closer to the parts of clinical informatics that focus on the organization of 
clinical information, particularly for research purposes. The main challenges for 
biomedical informatics within pharmacogenomics fall into nine areas: 


. Representing the diversity of pharmacogenomic data 
. Developing standards for data exchange 
. Integrating data from multiple data resources 


. Mining literature for knowledge 


. Understanding the structural basis for variability 
. Using comparative genomics 


1 
2 
3 
4 
5. Using expression data to understand regulation 
6 
7 
8. Managing laboratory information 

9 


. Protecting sensitive patient information 


Representing the Diversity of Pharmacogenomic Data 


One of the principal challenges for pharmacogenomics is the creation of data struc- 
tures that store relevant information in a form that is easy for computer programs to 
manipulate. There is a difference between data formats that are useful for human 
readers (journals, tables, figures) and those that are useful for computers (data 
structures in computer programs that label all data for easy retrieval and analysis). 
The classes of data that must be represented are diverse, as are the connections 
between the data that must be maintained. 


GENOMIC DATA The representation of DNA sequence information for pharma- 
cogenomics is similar to that required for many other applications. The main 
requirement is that the gene structure for a protein product be understood and 
labeled so that observed DNA sequence variations can be interpreted as be- 
longing to coding or noncoding regions, and their likely significance can be 
evaluated. The key concepts that must be modeled include genomic sequence, 
unprocessed mRNA transcript, processed transcript, and protein sequence. Within 
each of these models are details such as the 3’- and 5’-untranslated regions of 
genes, genetic regulators (enhancers, silencers), exons that are coding or non- 
coding (or partially coding), and alternative splicing strategies. There have been 
a number of proposed standards for tracking genomic data, including the data 
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structures behind Genbank (48), Human Genome Database (49), BIOML2, and 
others. The human genome browsers offered by UC Santa Cruz’, Ensembl’, 
National Center for Biotechnology Information (NCBI), and Celera® offer a basic 
look at gene structure, but these are still evolving because the genome is in draft. 
In addition to the representation of basic gene structure, it is critical to also under- 
stand the locations and types of genetic variation. The dbSNP resource at NCBI 
provides an excellent source of reported SNPs (50), including those submitted by 
The SNP Consortium (51), an industrial group that is performing large-scale SNP 
detection and submitting many of these to the public domain. 

Genome data are also made more useful by their connections to databases of 
biological function, including the Online Mendelian Inheritance in Man (OMIM) 
database of inherited human disorders (52, 53), and anumber of specialty databases 
that provide valuable in-depth information about individual gene families, such 
as the Cell Signaling Network Database (54), the transcription factor database 
TRANSFAC (55), and the protein kinase database (56). 


MOLECULAR AND CELLULAR DATA The characterization of phenotype is important 
for both the genotype-to-phenotype methods as well as the phenotype-to-genotype 
methods. Phenotype is difficult to precisely define, but it can be thought of as 
functional features of gene products, ranging in detail from the molecular to the 
individual and population levels. Unfortunately, phenotype data are not as “digi- 
tal” as sequence data, so they are much more difficult to represent. Nevertheless, 
the success of pharmacogenomics depends on the establishment of standards for 
describing these data. 

In pharmacogenomics, a few types of molecular and cellular data are clearly 
critical to represent. These include enzyme kinetic data (such as the binding and 
catalytic constants of enzymes, and their associated kinetic parameters), three- 
dimensional structural data (when available) for enzymes and their substrates/ 
ligands, and protein localization data (often images) that show where different gene 
products are found within the cell. Standard representations of these data types are 
not generally available, but the creation of databases to store such information will 
require that they be developed. Fortunately, there is fairly good agreement on the 
basic vocabulary of enzyme kinetics and the definition of these parameters in basic 
pharmacology texts (and associated programs for computing the parameters), as 
well as the representation of three-dimensional structural data in the Cambridge 
Crystallographic Database’ and the Protein Data Bank (PDB)®. 


*http://www.bioml.com/BIOML/ 
3http://genome.ucsc.edu/ 
‘http://www.ensembl.org/genome/central/ 
Shttp://www.ncbi.nim.nih. gov/genome/guide/central.html 
Shttp://.public.celera.com/index.cfm 
Thttp://www.ccde.cam.ac.uk/ 

Shttp://www.rcsb.org/pdb/ 
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Microarray expression data are clearly going to become an important phe- 
notypic data source for pharmacogenomics (57-59), and standards are in active 
development at this time, although they have not settled yet. There is a proposal 
for a MicroArray Markup Language (MAML) that would specify a minimal set of 
information for exchange. MAML is part of a larger effort to develop standards for 
microarray data and databases in the Microarray Gene Expression Database? effort. 
The challenges here involve developing standards for representing the experimen- 
tal conditions, the quality control parameters, the list of genes being assayed, and 
the actual expression measurements (and background measurements) recorded. 


CLINICAL DATA Pharmacogenomics requires a connection to clinical medicine in 
order to establish the relevance and importance of the systems that are studied. As 
such, clinical medicine can be considered simply another phenotype, as molecular 
or cellular data. However, the techniques used to collect and describe clinical data 
are sufficiently different from basic biological data that the approaches should be 
distinguished. At the most basic level, pharmacogenomics information resources 
need to store clinical information pertaining to the most commonly measured 
phenotypes: one, pharmacokinetic profiles of drug levels in response to dosing and 
two, measures of pharmacodynamic efficacy based on the target effects. In addition, 
the incidence of side effects in response to medications must be represented. 

Although there is a large literature on the representation of clinical data, much 
of this is for the purposes of supporting the delivery of clinical care and not 
for clinical research. Clinical research requires precision in ways different from 
clinical care, so the portability of most clinical data standards is not clear. The 
standards that exist for coding diagnoses (International Classification of Diseases 
standard!°), pathology (Systematized Nomenclature of Medicine!!), procedures 
(Current Procedural Terminology!*), and others offer a good starting point for 
pharmacogenomics research but in general do not provide the precision required 
for high-quality data storage. 

As is the case for enzyme kinetics, there is a fair amount of uniformity within the 
pharmacology community on how to represent pharmacokinetic profiles. Programs 
such as ADAPT II'3, NONMEM!*, SAAM 30, and CONSAM!> exist that all 
allow first- and second-order kinetics and associated parameters to be computed. 
A standard set of parameters, including K;, Km, and Vmax, have fairly consistent 
definitions and thus provide a good initial opportunity for modeling of the data. 

One issue that arises in modeling phenotypic data is the relationship between 
raw data and the more processed parameters and intermediate representations. 


°http://www.mged.org/ 
‘Ohttp:/Awww.cdc.gov/nchs/about/otheract/icd9/abticd10.htm 
"http:/Avww.snomed.org/ 

2h ttp://www.ama-assn.org/ama/pub/category/3113.html 
Shttp://www.usc.edu/dept/biomed/BMSR/Software/adptmenu.html 
‘4h ttp://c255.ucsf.edu/nonmem0.html 
'Shttp://www-saam.nci.nih.gov/index.html 
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Although the raw data are used to compute these intermediate representations (for 
example, the raw time points of blood levels are used to compute pharmacokinetic 
parameters), it can be difficult to determine the appropriate level of data to make 
routinely available in databases. The raw data can be cumbersome, and the com- 
puted parameters may be of real interest to most. However, there are times when 
the raw data must be retrieved in order to check conclusions or alternative inter- 
pretations. Thus, a major challenge for pharmacogenomic information resources 
is to provide easy access both to the computed/derived parameters as well as to 
the basic information upon which they are based. 


Developing Communication Standards in Pharmacogenomics 


One way in which informatics technologies can help accelerate progress in a field 
is the development of standards for representing and exchanging data. It is clear 
that shared understanding of the basic data elements within pharmacogenomics is a 
critical building block with which to build an information infrastructure. Methods 
for communicating these data are therefore equally important. The two main ar- 
eas that require progress are the definition of shared syntax (how information is 
structured in a data file) and semantics (how the information should be interpreted 
by others). Two contributing technologies within informatics that address these 
problems are technologies for defining shared vocabularies and technologies for 
exchanging them. 

A standard vocabulary is a controlled set of terms that can be used instead 
of free text to communicate information. For example, whereas an abstract in 
Medline is free text, the list of Medical Subject Heading keywords from the 
MEDLINE database!® represents a controlled vocabulary. The advantage of a 
controlled vocabulary is that computer programs can be written to expect cer- 
tain phrases and can be instructed how to process data based on the occurrence 
of these phrases. Whereas free text has much more power for expressiveness, 
it is difficult for computer programs to understand because of the inherent 
ambiguities in human-to-human communication. Some of these difficulties are 
addressed by natural language processing techniques (reviewed below). However, 
a principal means to address these problems is to develop and adopt controlled 
vocabularies. 

Some vocabularies have been developed to facilitate the delivery of clinical care, 
and these may be helpful in pharmacogenomics. Many contributing vocabularies 
have been related to one another as part of the Unified Medical Language System 
project at the National Library of Medicine (60). To support these endeavors fully, 
however, these vocabularies need to be supplemented by establishing the following 
standards. 


1. Human gene names and links to other organisms. The Human Genome 
Nomenclature Committee has created a reference set of symbols for human 


'Chttp://www.ncbi.nlm.nih.gov/PubMed/ 
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genes, and this should stabilize over time and provide a useful set of indices 
for the human genome browsers. Included in this activity is the identifica- 
tion of the function of new genes of pharmacogenomic interest, particularly 
transporters [which are classified in a taxonomy by Saeir!” (61)] and the 
cytochrome p450 system (classified based on isoform similarity) (62, 63). 


2. Drug and compound names. There are efforts proposed to build a con- 
trolled list of drug categories, their structural and biological features, and 
the associated specific compounds. The Unified Medical Language System 
(UMLS) contains the 1997 Food and Drug Administration Standard Product 
Nomenclature.!8 


3. Side effects. Standards are required for coding drug side effects, at a clinical 
level and perhaps at a lower biological level. A vocabulary that is used for 
clinical trials is a good initial start. The UMLS contains the World Health 
Organization Adverse Drug Reaction Terminology and the coding symbols 
for a thesaurus of adverse reaction terms (COSTART) from the Food and 
Drug Administration. !? 


Data Exchange Standards 


eXensible Markup Language (XML) has emerged as a common standard for the 
exchange of data (64)?°. XML is a syntax for specifying how text data can be la- 
beled so that computer programs can load the data items into their memory struc- 
tures. Without a standard such as XML, competing file formats would abound, 
and programs would work only with a subset of these formats. It then becomes 
very difficult to exchange data and run programs. Similarly, databases often are 
constructed independently and organize their data quite differently (including de- 
cisions as simple as whether a drug dose should be specified as a single string 
“20 mg oral” or as separate fields). XML provides a partial solution to this prob- 
lem by providing a standard syntax for specifying the elements within the file 
and how they are presented. It becomes relatively easy, then, to read files in one 
format and then to translate them into a newer format. Even better, however, is for 
the community to adopt a single format for uniform representation of data. The 
PharmGKB database”! is attempting to define some XML standards for data that 
will allow basic pharmacogenetic data to be formatted in a standard manner. 

A more difficult issue is the semantics of the data representation. Syntax only 
enforces the order of the elements and their basic types (integers, real numbers, 
strings, and the like), but specifications of semantics require more constraints on the 


‘Thttp://www-biology.ucsd.edu/~msaier/transport/titlepage.html 
'Shttp://www.fda.gov/cder/ndc/database/default.htm and http://www.fda.gov/cder/ob/ 
default.htm 

http://Awww.fda.gov/cder/aers/index.htm 

http://www.w3.org/XML/ 

*Ihttp://pharmgkb.stanford.edu/xml-schemas.html 
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logical relationships between data items (for example, a “nonsynonymous SNP” 
must be at a genome sequence position that is a SNP, must be within a coding 
region, and must change the amino acid for which it is coded). The specification of 
semantics is an active area of research, but the Resource Description Framework 
(written in XML itself2?) is an attempt to enable the standardization of semantics. 
In addition, knowledge base management systems support the logic and constraint 
checking that is required for computationally enforcing semantics (65, 66). 


Integrating Data From Diverse and Heterogeneous Databases 


Pharmacogenomics research is marked by the diversity of databases that must 
be used in order to answer important questions. For example, in order to find 
all three-dimensional protein structures with SNPs that change the amino acid 
in the coding region of proteins that are involved in diabetes, we must combine 
gene sequence data [such as are found in GENBANK (48)], SNP data [dbSNP 
(50)], three-dimensional structural databases [Protein Data Bank (67)], databases 
of genetic diseases and their gene defects [OMIM (52)], and the medical literature. 
Other queries might require databases of drugs or drug-drug interactions, which 
are not publicly available at this time. 

The problem of integrating data is a difficult one within computer science. 
One approach is to create a single large data model and to dump the contents 
of all the contributing databases into the new “mega” database. This approach, 
called consolidation, suffers because as the contributing databases evolve, the 
consolidated database becomes out of date. In addition, itis very difficult to build a 
large data model. Another class of approaches to database integration is federation, 
which can take three forms. In the first, databases are linked together loosely with 
hyperlinks on the web, offering little help to automatic programs but useful for 
human users. In the second, programs are written to extract certain types of data 
from each database and combine them to answer queries that require data from 
more than one database (68). In the third, programs are written to extract the data 
on a regular basis from the contributing databases and dump them into a common 
database that is then updated regularly [but functions as a consolidated database 
between updates (69)]. 


Mining the Published Literature for Pharmacogenomic Data 


The Medline/PubMED resource contains references to more than ten million 
biomedical publications, and many of these offer online abstracts. In addition, 
many journals are now making their articles available online in full text. Although 
this mode of publication is effective for human users, it is difficult for comput- 
ers to extract information from natural language text. At the same time, there are 
decades of biological knowledge stored in written natural language text. In order 


2http://www.w3.org/RDF/ 
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to avoid the loss of this information and to assist in the automatic population of 
databases, informatics researchers are building systems for extracting information 
from text. The general problem of understanding the full details of a natural lan- 
guage text has been studied for more than four decades and remains unsolved (70); 
however, the more tractable goal of reliably identifying relationships within text 
is within reach. For example, texts can be analyzed to extract protein names (71), 
and protein-protein (11, 72, 73) or protein-drug interactions (74), based on the oc- 
currence of protein names and verbs such as “inhibits,” “activates,” “represses,” 
“enhances,” etc.). 

Within pharmacogenomics there are good opportunities for natural language 
processing (NLP) techniques to assist in the organization of data. First, there is no 
definitive list of drug-gene interactions, and the literature (both published medical 
literature and the U.S. patent application literature’) is filled with associations 
that are pharmacokinetic (e.g., “X is metabolized by CYP2D6”) and pharmaco- 
dynamic (e.g., “Y is active at the beta-adrenergic receptor’). In general, NLP 
techniques work best in well-defined domains that use standardized vocabulary. A 
second area of opportunity within pharmacogenomics is the extraction of cellular 
localization information (“X is localized to the Golgi”) from text (75). A third 
area that holds much promise is the processing of mRNA expression data with 
microarrays (8, 20,57,59, 76-79). Because of the great volume of information 
generated by these experiments, it is often useful to cluster the genes based on 
expression pattern, and it is very challenging subsequently to summarize the key 
features of each cluster. NLP-based techniques may be able to combine the pub- 
lished information about genes with information about how they cluster to create 
automatic cluster labels that provide biological insight. A fourth area for NLP ap- 
plications is in the identification of genes of pharmacogenomic interest from text 
and in the classification of these genes as primarily of pharmacokinetic or phar- 
macodynamic importance. The language within pharmacokinetic papers is quite 
idiosyncratic (including discussions of “area under curve” and “bioavailability’’), 
so it may be relatively straightforward to classify abstracts that are discussing this 
topic and then to extract key data elements from them. 


Using Expression Data to Assess the 
Phenotypes of Drug Response 


The emergence of DNA microarrays to measure mRNA expression has created 
excitement in many areas of biomedical research, including pharmacogenomics 
(8, 20, 57, 59, 76-79). These microarrays use hybridization of amplified RNA from 
samples of interest to DNA of known sequence (that have been affixed to small 
spots that are arranged into a square array) in order to measure the level of gene 
expression in the samples. The use of microarrays for pharmacogenomics has only 
begun, but it has the potential of bringing great gains because they can be used to 
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address the most difficult steps of both genotype-to-phenotype and phenotype-to- 
genotype approaches. In genotype-to-phenotype investigations, microarrays can 
assist in the third step to find phenotypes at the cellular and molecular level that 
are associated with variations in genotype (and in the context of administering 
certain drugs). In phenotype-to-genotype studies, microarray measurements can 
be used in the second step to find genes whose expression alters in the context of 
an important new phenotype. 

The most common informatics analyses of microarray data are currently clus- 
tering of genes and classification of genes based on shared expression patterns 
(80). These groupings can be used as evidence for “guilt-by-association” assign- 
ments of function, whereby the function of a gene is assumed to be similar to the 
function of genes with which it is grouped. The most exciting work in microarrays, 
however, involves combining information from microarrays with other data sets. 
One important study measured the expression levels of genes within sixty cancer 
cell lines and compared the sensitivity of each of these cell lines to over 70,000 
different potential anticancer drugs (58). The key expression features that identi- 
fied potential sensitivity to the anticancer drugs were defined, and cell lines were 
clustered based on common potential sensitivities. Microarray analysis has also 
been used to study the pharmacogenomics of cystic fibrosis (81), schizophrenia 
(82), and others. We expect that in the future microarray analysis of cells before 
and after drug exposure will provide an important set of pharmacogenomic data 
for determining the full set of changes that occur at a cellular level, as well as for 
determining the cells’ kinetics (59). 


Understanding the Structural Consequences 
of Genetic Variations 


Pharmacology has always had a strong structural component because the three- 
dimensional structure of drugs can be critical for understanding mechanisms of ac- 
tion and for building pharmacophores for drug design (83). The increasing number 
of structures in the 3D structural database also allows us to model the interactions 
between proteins and their ligands in order to gain a high-resolution understanding 
of drug action. The PDB now has over 15,000 individual structures (67), and the 
emergence of high-throughput structure determination efforts promises to main- 
tain a rapid rate of data acquisition (84). The availability of more 3D structures 
makes it increasingly likely that a homologous protein of known structure will be 
available for most proteins of pharmacogenomic interest. The types of structural 
analyses that are becoming important include the following methods 


1. Methods for docking small molecules into binding pockets of proteins in 
order to predict affinity. There have been a number of successful reports in 
this area, including algorithms based on energetics and based on statistical 
analysis of pockets (85-87). 


2. Methods for homology modeling in order to build models of protein varia- 
tions. A variety of programs have been developed and tested and offer good 
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options for building 3D models of proteins that are globular and share 30% 
or more sequence identity with a known structure (88). In these applica- 
tions, the location and possible functional significance of nonsynonymous 
SNPs in the coding portion of proteins can be evaluated (89). A recent paper 
estimated that 30% of all nonsynonymous SNPs may be associated with 
significant changes in function (90) based on an analysis of a large set of 
mutations in DNA binding proteins. There are also some early indications 
that even synonymous SNPs may change RNA stability and affect the level 
of activity for some proteins. 


3. Methods for predicting protein-protein interactions. It is clear that many 
proteins have multiple partners with which they interact as activators, in- 
hibitors, or otherwise as modifiers. There has been progress in molecular 
docking algorithms (9 1-96) that allows investigators to combine geometric 
and energetic properties for the purpose of understanding how two protein 
surfaces may interact. 


Comparing Genomes to Develop Pharmacogenomic Models 


Comparative genomics is the study of multiple genomes from different organ- 
isms in order to define the shared characteristics between organisms as well as the 
distinguishing characteristics of each organism (41,42). For pharmacogenomics, 
comparative genomics can identify analogs to human drug response phenotypes in 
organisms that are more readily manipulated experimentally. In general, the coding 
regions of rat, mouse, and pig are more strongly conserved relative to human than 
the noncoding regions. Recent work shows that noncoding regions can be con- 
served across species and can be important for regulating gene expression (97). 
Thus, the availability of complete genomes for related species is likely to assist 
pharmacogenomics researchers in identifying areas outside coding regions that 
may be worth studying for polymorphisms. Because the background rate of poly- 
morphisms in humans is so high, it is critical to have these clues from comparative 
genomics to guide the selection of regions to emphasize in the search for critical 
variations. These analyses depend on accurate alignments of large segments of 
genomic DNA, and special purpose algorithms have been developed (97, 98). 
Comparative genomics techniques can also be used to understand metabolic 
and genetic regulatory pathways. Metabolic databases such as EcoCYC (99) and 
the Kyoto Encyclopedia of Genes and Genomes (KEGG) (100) have been con- 
structed for a number of complete bacterial genomes and are beginning to emerge 
for eucaryotes as well. These resources will allow a computational analysis of 
pathways in determining genes that should be studied and perhaps in determining 
their role in metabolism or drug action (101). Genetic regulatory pathways are 
summarized in the Cell Signaling Network Database (54) and the signal transduc- 
tion knowledge environment (STKE*%), and can also be studied with databases 
of transcription factors and regulatory sequences (55). The use of comparative 
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microarray expression analysis is also being evaluated as a strategy for filtering 
important signals from microarray expression experiments (102, 103). 


Managing Laboratory Information Data 


Although it is not peculiar to pharmacogenomics, the development of reliable lab- 
oratory information management systems (LIMS) is as critical to this field as any. 
Tracking the large number of samples that must be tracked of patients, tissues, 
cell lines, and individual genes and gene products is a nontrivial bookkeeping 
challenge. Excellent systems have been developed in the context of disease re- 
search networks for web-based tracking and linking of samples to core data sets 
(104). Pharmacogenomics offers a particularly challenging application area for 
LIMS because of the diverse array of data and samples that are relevant and be- 
cause genomic data are being linked to clinical data for the purposes of finding 
genotype-phenotype associations. In addition, the emergence of publicly available 
tissue samples for sampling genomic diversity (105) creates a tracking problem 
for samples worldwide. 


Protecting the Confidentiality and Privacy 
of Clinical Phenotype Data 


The study of pharmacogenomics and the desire to disseminate data pertaining 
to pharmacogenomics raise a number of important issues in ethics and patient 
confidentiality. The need to integrate molecular data with clinical data implies 
that clinical phenotype information may be distributed on the internet. It is cru- 
cial, however, that the privacy and security of patient identity be maintained 
while disseminating these data. Simple methods of “de-identification” (in which 
basic identifying information such as name, address, and other demographic infor- 
mation is removed) for patient protection is not adequate. As more information is 
provided about a patient, even though it is not directly identifying, it can often be 
combined with other data sources (such as hospital discharge records and voter and 
driver registration information) to reconstruct, either exactly or probabilistically, 
the identity of patients (106). There are precedents for the publication of patient- 
related data in journals as well as in databases (such as Genbank). Nonetheless, 
it is critical to ensure that the availability of clinical phenotype data sets does not 
lead to the loss of study-subject confidentiality or privacy. 

There are generally two approaches to protecting patient privacy. The first, 
called mediation, inserts a computer program in between a user and a database 
and monitors the queries that are asked and the answers that are provided by 
the database. The mediator has rules about the kinds of queries that can be writ- 
ten and the kinds of responses that can be supplied from the database. It stops 
queries that are inappropriate (based on the rules, which often include infor- 
mation about the privileges of the user), and it filters results that are inappro- 
priate. These mediators have been shown to provide reasonable semi-automated 
protections (107). A second method, called scrubbing, is based on the principle of 
removing information from a data set so that the details of the data cannot be used 
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to re-identify a patient (108, 109). Thus, for example, a list of pharmacokinetic 
parameters can be either rounded off to reduce precision or can be provided as 
ranks instead of absolute values. Each of these maneuvers would serve to increase 
the pool of subjects from whom these values could come and thus protect the pri- 
vacy of individual subjects. Crucial to scrubbing is the idea of “bin size,” which is 
the minimum number of people (in some defined population) who match a certain 
query. Thus, we may say that in a study with 500 patients, we will never answer a 
query with results containing less than 10 subjects, so that individual subject data 
cannot be teased out. The bin size has been used by other organizations, such as 
the social security administration in the United States, and the U.S. census bureau 
(106). The obvious disadvantage to scrubbing is the loss of precision, which can 
make certain statistical analyses much more expensive. Although not within the 
scope of this review, there are associated implications about how study subjects 
should exercise informed consent when participating in pharmacogenomic stud- 
ies. The other critical issue that is outside the technical scope of this paper, but 
deserves mention, is the problem of having pharmacogenomic information used 
to discriminate (either in terms of the research agenda, insurance, or employment) 
against groups within the population based on statistical associations. 


PHARMACOGENOMICS: A NEW CHALLENGE 
FOR BIOMEDICAL INFORMATICS 


The focus of biomedical informatics has, in the past, been fragmented into in- 
formatics to meet the challenges of genomics (sequence analysis, structure anal- 
ysis, biochemical and regulatory pathway analysis) and informatics to meet the 
challenges of organizing clinical data (medical records, information extraction, 
database integration). In the post-genome period, there will be an increasing num- 
ber of applications that require the combination of basic bioinformatics with clini- 
cal informatics. Pharmacogenomics is an excellent example of such an application 
area. Many different branches of biomedical informatics clearly will play a critical 
role in gathering, organizing, and analyzing pharmacogenomic data. The National 
Institutes of Health has recently formed a Pharmacogenetics Research Network 
and Database” program whereby several research groups are cooperating to gather 
pharmacogenomic data (genomic, molecular, cellular and clinical) and deposit it 
in a common database for public use. The database, PharmGKB”°, is intended to 
gather data not just from these groups, but from all groups worldwide wishing 
to disseminate their data of pharmacogenomic relevance. The initial focus is on 
building representations and XML standards for the submission of basic genomic 
variation data, enzyme kinetic data, and clinical pharmacokinetic data, but work 
is being done in all the areas reviewed here. As the basic infrastructure is created, 
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there will be opportunities for the community to create and distribute informatics 
methodologies that address each of the challenges outlined here. There is no doubt 
that other, perhaps unexpected, informatics challenges will arise in the course of 


creating these resources. 


Visit the Annual Reviews home page at www.AnnualReviews.org 
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